QTM 447 Lecture 16: Recurrence and Sequence Modeling

Kevin McAlister

March 6, 2025

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

“Feedforward” Neural Networks

Thus far, we have only seen deep learning architectures where the information from the inputs flows upwards through the model

  • Inputs go through layer and yield hidden values

  • Hidden values to more hidden values

Feedforward in the sense that the only direct influence of a layer is to its child

  • In computational graphs, there is no complicated structure that creates feedback between units

“Feedforward” Neural Networks

For tabular data and image analysis:

  • A single input relates to a single output

    • Classification \(\rightarrow\) Class

    • Image Segmentation Map \(\rightarrow\) Semantic Segmentation over \(H \times W\) image

  • A single input leads to two (or more) independent outputs

    • Regression \(\rightarrow\) (Mean, Variance)

    • Bounding Box \(\rightarrow\) (Class, Location)

“Feedforward” Neural Networks

  • Image classification (One image to One Class)

Recurrent Neural Networks

  • Image captioning (One image to a sequence of words)

  • Other examples?

Recurrent Neural Networks

  • Video classification (Many images/frames to one video class)

  • Other examples?

Recurrent Neural Networks

Not necessarily equal number of inputs and outputs

  • Machine translation (Sequence of words in English to Seq in French)

  • Sentence continuation (First part of sentence to second part of sentence)

  • Time series (Stock Prices for the last 10 years to Stock Prices for the next week)

Recurrent Neural Networks

Equal number of outputs and inputs

  • Per-frame video classification (Seq of Images to Seq of Words)

  • Per-sentence sentiment analysis (Seq of sentences to Class)

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are the backbone of sequence modeling

  • Inputs or outputs make sense temporally

  • Knowing what came before or what comes after should inform what we predict at a current point in time

Adds knowledge to the NN architecture

  • Units need to talk backwards in time - all of the previous values need to influence what I guess next!

  • Note, we’re going to use time pretty loosely here - just some sort of meaningful temporal flow

If the order of the input matters, then we should preserve that information!

Recurrent Neural Networks

For now, let’s think in terms of the many-to-many framework:

  • A sequence of input vectors , \(\mathbf X = \{\mathbf x_1, \mathbf x_2,...,\mathbf x_T \}\), where each vector is of length \(P\)

  • A sequence of output vectors, \(\mathbf Y = \{\mathbf y_1, \mathbf y_2,...,\mathbf y_T\}\), where each vector is of length \(G\)

Goal: Use \(\mathbf X\) to predict \(\mathbf Y\)

Example: Predict the next character

\[ \mathbf X = [[h],[e],[l],[l]] \]

\[ \mathbf Y = [[e],[l],[l],[o]] \]

Recurrent Neural Networks

Recurrent Neural Networks

We can process a sequence of vectors, \(\mathbf x\), by applying a recurrence formula at every time step:

\[ h_t = f_w(h_{t - 1}, x_t) \]

  • \(h_{t - 1}\) is the old state (some vector of numbers)

  • \(x_t\) is the input state at time \(t\) (some vector)

  • \(f_w()\) is a function of some parameters, \(\mathbf W\) (some collection of vectors)

  • \(h_t\) is the new state (some vector)

Recurrent Neural Networks

The hidden state vector contains latent information about the current state of the model

  • For language models, it could tell us about the gender of pronouns in the sentence

  • Or the passiveness of the tone

This information should persist throughout the sequence of predictions!

  • Create memory from one prediction to the next.

Recurrent Neural Networks

Start with an initial state, \(\mathbf h_0\)

Recurrent Neural Networks

Going from \(\mathbf h_0\) to \(\mathbf h_1\):

\[ h_1 = f_w(h_{0}, x_1) \]

  • The action here happens in how we related the hidden states!

Recurrent Neural Networks

The vanilla update:

\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]

  • \(\mathbf h_t\) and \(\mathbf h_{t-1}\) are hidden vectors of size \(m\)

  • \(\mathbf W_{hh}\) is a \(m \times m\) matrix of weights that controls the mapping of one hidden state vector to the next

  • \(\mathbf W_{xh}\) is a \(m \times P\) matrix of weights that controls the mapping of the input to the hidden state

  • \(\mathbf b_h\) is a vector of bias terms

  • \(\text{tanh}()\) is an activation function

Recurrent Neural Networks

Known values:

  • \(\mathbf x_t\)

Unknown values to be learned:

  • \(\mathbf h_t\) and \(\mathbf h_{t - 1}\)

  • \(\mathbf W_{hh}\), \(\mathbf W_{xh}\), and \(\mathbf b_h\)

Recurrent Neural Networks

The vanilla update:

\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]

Drawing some parallels to what we’ve seen before:

  • \(\mathbf h_t\) is like a higher level of hidden values that is a function of the previous set of hidden values passed through an activation function

  • \(\mathbf W_{hh}\) is the matrix of hidden weights that maps these two hidden values

  • \(\mathbf W_{xh}\) is a matrix of weights that grounds the new hidden state to its input

A FCNN with a twist!

Recurrent Neural Networks

Recurrent Neural Networks

Tanh is used here over ReLU for a number of different reasons:

  • RNNs aren’t going to promote sparsity since there isn’t a meaningful way to map hidden units to input features

  • RNNs have a really serious vanishing gradient problem (more to come)

  • Unlike CNNs, RNNs have a really serious exploding gradient problem!

  • Induce nonlinearities without sparsity in the feature maps for optimization purposes

  • Squish before sparsity here!

We’ll see sigmoids reappear in a little bit!

Recurrent Neural Networks

Recurrent Neural Networks

Note that \(\mathbf x_1\) only relates to \(\mathbf y_1\) through \(\mathbf h_1\)!

Given a hidden state, we say that \(\mathbf y_t\) is determined as:

\[ \underset{(G \times 1)}{\mathbf y_t} = h(\underset{(G \times m)}{\mathbf W_{hy}} \underset{(m \times 1)}{\mathbf h_t} + \underset{(G \times 1)}{\mathbf b_y}) \]

  • A linear combination of a weight matrix and the hidden state (plus a bias vector)

  • \(h()\) is a function that maps the linear predictor to the scale of the actual output (softmax for classes, identity for regression)

Recurrent Neural Networks

The part that is so simple that it’s kinda dumb

  • But works amazingly

Use the same \(\mathbf W_{hh}\), \(\mathbf W_{xh}\), and \(\mathbf W_{hy}\) at all time points

Recurrent Neural Networks

Recurrent Neural Networks

Let’s think about how \(\mathbf y_2\) is computed.

  • \(\mathbf y_2\) is a function of \(\mathbf h_2\)

  • \(\mathbf h_2\) is a function of \(\mathbf h_1\) and \(\mathbf x_2\)

  • \(\mathbf h_1\) is a function of \(\mathbf h_0\) and \(\mathbf x_1\)

The second output in the sequence is determined by

  • All previous hidden states

  • All previously seen input vectors!

Recurrent Neural Networks

Let \(\xi()\) be the \(\text{tanh}\) activation function. Dropping bias terms for notational simplicity:

\[y_2 = \mathbf W_{hy} \mathbf h_2\]

\[\mathbf h_2 = \xi(\mathbf W_{hh} \mathbf h_1 + \mathbf W_{xh} \mathbf x_2)\]

\[\mathbf h_1 = \xi(\mathbf W_{hh} \mathbf h_0 + \mathbf W_{xh} \mathbf x_1)\]

Stacking it all together:

\[\mathbf y_2 = \mathbf W_{hy} \xi(\mathbf W_{hh} \xi(\mathbf W_{hh} \mathbf h_0 + \mathbf W_{xh} \mathbf x_1) + \mathbf W_{xh} \mathbf x_2)\]

  • Each outcome is a weighted combination of the previous inputs and the initial hidden state (which we typically set or learn separately)

Recurrent Neural Networks

Assuming linear activation functions, we can see how the weights stack:

\[ \mathbf y_2 = \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{hh} \mathbf h_0 + \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{xh} \mathbf x_1 + \mathbf W_{hy} \mathbf W_{xh} \mathbf x_2 \]

\[ \begin{align} \mathbf y_3 = & \mathbf W_{hy} \mathbf W_{xh} \mathbf x_3 + \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{xh} \mathbf x_2 + \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{hh} \mathbf W_{xh} \mathbf x_1 +\\ & \mathbf W_{hy} \mathbf W_{hh} \mathbf W_{hh} \mathbf W_{hh} \mathbf h_0 \end{align} \]

  • If the largest singular value of \(\mathbf W_{hh}\) is less than 1, then the effect of each input diminishes as time goes on!

  • Small times small (e.g. absolute value less than 1) is even smaller.

Recurrent Neural Networks

The RNN is said to have infinite memory

  • What it saw before will always play a part (albeit with diminishing influence) in what it predicts next!

  • Recursion!

Recurrent Neural Networks

What about training a RNN?

  • Associate each prediction - \(\{\hat{\mathbf y}_1,\hat{\mathbf y}_2,...\}\) - with the actual outcome - \(\{\mathbf y_1,\mathbf y_2,...\}\) - and compute per vector loss

  • Then, add ’em together

\[ \mathcal L(\hat{\mathbf Y} , \mathbf Y) = \sum \limits_{t = 1}^{T_y} \mathcal L(\hat{\mathbf y}_t , \mathbf y_t) \]

Recurrent Neural Networks

Recurrent Neural Networks

Many-to-One:

  • \(\mathbf W_{hy}\) is only used for the single output

Recurrent Neural Networks

One-to-Many:

Recurrent Neural Networks

Same deal over and over again!

  • Three common weight matrices that are shared in all appropriate places

RNNs are widely applicable, but not all that different than what we’ve already seen!

Recurrent Neural Networks

Pre-GPT, the seq2seq model was a main approach for question/answer models

Recurrent Neural Networks

This is another example of an encoder/decoder architecture:

  • Use weights to determine the mapping of the inputs to a final hidden state

  • Use that hidden state as the beginning of a new mapping to to the outcome

Computationally, not that bad!

A Simple Language Model

Let’s look at an aligned many-to-many model!

Goal: Given input characters, build a model to predict the next character!

A Simple Language Model

A Simple Language Model

A Simple Language Model

A Simple Language Model

A Simple Language Model

A Simple Language Model

A Simple Language Model

A Simple Language Model

For vanilla test-time, generate new text

  • Suppose we just have an “h” and want to see what the RNN will generate

A Simple Language Model

A Simple Language Model

A Simple Language Model

A Simple Language Model

A Simple Language Model

Right now, a basic one-hot encoded character vector

  • What if we want to see something that was not necessarily used to train the model?

Text embeddings are often used to pre-process text (an inherently discrete variable) into a lower dimensional continuous space

  • We’ll discuss this more in the next classes

Same idea as the embeddings you’ve used on PS2!

  • Can be pre-trained

Recurrent Neural Networks

Can RNNs be deep? Sure!

RNN Gradients

But, there’s a fundamental problem to address first - gradients

  • Given \(\mathbf X\) and \(\mathbf Y\), we need to learn \(\mathbf W_{hh}\), \(\mathbf W_{xh}\), and \(\mathbf W_{hy}\)

  • The hidden states \(\{\mathbf h_1,...,\mathbf h_T\}\) are just a function of these values

Backprop through time:

  • Set initial values for the weights and the initial hidden states

  • Forward pass through time to compute the hidden state values

  • Backwards pass through time to compute the gradients

RNN Gradients

Many-to-One:

RNN Gradients

The gradient for \(\mathbf W_{hh}\):

\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T}\frac{\partial \mathbf h_{T}}{\partial \mathbf h_{T - 1}}\frac{\partial \mathbf h_{T-1}}{\partial \mathbf h_{T - 2}}...\frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \]

Let’s assume that our activation function is tanh:

\[ \mathbf h_t = \xi(\mathbf W_{hh} \mathbf h_{t - 1} + \mathbf W_{xh} \mathbf x_t) \]

Then:

\[ \frac{\partial \mathbf h_{t}}{\partial \mathbf h_{t - 1}} = \text{tanh}'(\mathbf W_{hh} \mathbf h_{t-1} + \mathbf W_{xh} \mathbf x_t) \mathbf W_{hh} \]

RNN Gradients

RNN Gradients

The derivative of tanh w.r.t its input is always between 0 and 1!

The gradient for \(\mathbf W_{hh}\):

\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T}\frac{\partial \mathbf h_{T}}{\partial \mathbf h_{T - 1}}\frac{\partial \mathbf h_{T-1}}{\partial \mathbf h_{T - 2}}...\frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \]

can be expressed more compactly as:

The gradient for \(\mathbf W_{hh}\):

\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T} \frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \prod \limits_{t = 2}^T \frac{\partial \mathbf h_t}{\partial \mathbf h_{t - 1}} \]

RNN Gradients

The gradient for \(\mathbf W_{hh}\):

\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T} \frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \prod \limits_{t = 2}^T \frac{\partial \mathbf h_t}{\partial \mathbf h_{t - 1}} \]

Plugging in what we know:

\[ \frac{\partial \mathcal L}{\partial \mathbf W_{hh}} = \frac{\partial \mathcal L}{\partial \mathbf h_T} \frac{\partial \mathbf h_{1}}{\partial \mathbf W_{hh}} \mathbf W_{hh}^{T - 1} \prod \limits_{t = 2}^T \text{tanh}'(\mathbf W_{hh} \mathbf h_{t-1} + \mathbf W_{xh} \mathbf x_t) \]

  • Small times small equals smaller!

RNN Gradients

The gradient for the recurrence weights is almost surely going to vanish as the number of hidden states gets large!

  • And not even that many

  • Any values greater than 3 or less than 3 will wreck the gradient

Double trouble:

  • \(\lim_{t \to \infty} \mathbf W_{hh}^t\) depends on the largest singular value of the matrix

  • If it’s greater than 1, goes to \(\infty\)

  • If it’s less than 1, goes to zero

There’s no escape…

RNN Gradients

A simple solution is called truncated backpropogation

  • Limit the time window over which the loss gradient can propogate

  • Say no more than 10 steps in the past

Problem: We’re potentially giving up on important sequence information if we just say that there is no effect of \(\mathbf h_1\) on \(\mathbf h_{11}\)

LSTM

A clever solution - The Long Short Term Memory model (LSTM)

LSTM

There’s a lot going on here. Let’s take it step-by-step.

Remember that the next state at each step of an RNN is determined as:

\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]

Running example:

  • Language model trying to predict the next word based on all the previous ones

LSTM

LSTMs alter the RNN model by having two concurrently running hidden units - the hidden unit and the cell state

LSTM

The cell state runs straight down the chain with only some minor linear interactions

  • e.g. the gradient is really easy to compute and doesn’t change exponentially!

The cell state is of the same size as the hidden value!

  • It’s just a carefully collected aggregation of the RNN hidden values
  • A smooth version of the hidden state

LSTM

The hidden state can add or remove information from the cell state through gates

A sigmoid layer (which maps a value between 0 and 1) determines when and how much the hidden state should influence the next one!

LSTMs have 3 of these gates to protect and control the cell state.

LSTM

The first gate is called the forget gate

  • Each gate is associated with its own weights (shared across all hidden values) that control its action

LSTM

The forget gate controls how we lose information from the previous memory

  • In a language model, maybe the hidden state encodes the gender of the subject.

  • Maybe the subject changes and uses different pronouns?

  • It’s not worthwhile to remember the gender state in this case.

The sigmoid activation is applied to the weighted hidden state element-wise

  • Some parts of the vector might be more useful to remember than others

LSTM

The second gate is called the input gate

LSTM

Two weight matrices - the input weights and the candidate gate

  • A candidate hidden vector is created using the standard update with a tanh activation, \(\tilde{\mathbf C}_t\)

  • The input gate applies a value between zero and 1 to each element of the candidate gate that says how much of the new information should be combined with the old info to create the new hidden state

  • If we’re changing subjects and pronouns, we want a new state that corresponds to the new subject!

LSTM

The new cell state is a linear combination of the old cell state and the new cell state weighted by the forget weight and the input (new info) weight

LSTM

Finally, we need to map the cell state to the next hidden value

LSTM

The output gate create a filtered version of the cell state

  • Note that the cell state is a function of the hidden values, just attenuated depending on the amount of information that needs to flow through to maximize the predictive power

First, there is a sigmoid activation which determines which parts of the old state are going to combine with the cell state to get the new state

Then, we push the cell state through a tanh activation around the cell state and multiply it by the output gate so that we only keep the parts of the cell state that are useful for the next prediction!

LSTM

The earlier gates deal with changes to the state, but the output gate determines what parts of the state need to be altered

  • Perhaps the last word was a subject (pronoun change), so the next word is likely to be a verb and the part-of-speech portion of the hidden value should switch to suggesting a verb!

LSTM

LSTM

LSTM

What does this accomplish in terms of gradients?

  • Note that \(\mathbf h_{t-1}\) connects to \(\mathbf h_t\) in the regular way (just a few more sets of weights)

  • \(\mathbf C_t\) only connects to \(\mathbf C_{t-1}\) through \(\mathbf f_t\)

  • No direct interaction with \(\mathbf h\) or the weights!

We can think of the cell state as a smooth version of the hidden state that is updated with different small perturbations at each time period

\[ \frac{\partial \mathbf C_t}{\partial \mathbf C_{t - 1}} = f_t \]

  • Easy to deal with and unlikely to shrink to zero as it is not multiplicative!

LSTM

LSTMs are the default - not a special case

  • Most common RNN architectures are going to use LSTM architecture as they are actually much easier to train and produce the same results

We’re trading ease of training for some expressiveness of the hidden state

  • Changes are just going to be much more smooth than the vanilla RNN

  • Most of the time, this is closer to optimal than not

LSTM

Let’s look at some examples.

Recurrent Neural Networks

RNNs (at least 5 years ago) were the main tool for dealing with sequence problems

  • A new framework has emerged - multiheaded self attention

  • The famous transformer architecture

  • We’ll start discussing this next class.